Kovacs Adel: Final Project

PSZM21-MO-KUT-104:2 Komplex adatelemzési eljárások-Adatelemzés R-programnyelven

I have choosen the Starbucks dataset from TidyTuesday. I like coffee and I know some basic things about it (like it has caffeine, you can add milk, you can have a more concentrated espresso or a “long one”, an Americano), but I have seen some data that I am not very familiar with. The goal is to see what variables predict some attributes for different Sturbucks beverages.

Packages

I have loaded the necessary packages. I have came back here every time I needed an other one

library(tidyverse)
library(corrplot)
library(performance)
library(ggpubr)
library(broom)
library(car)
library(multcompView)

Loading the data

starbucks <- read.csv(url("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-12-21/starbucks.csv"))
View(starbucks)

Checking and modifying variables

I checked the type of variables I have, I am going to change them if it seems necessary

str(starbucks)
## 'data.frame':    1147 obs. of  15 variables:
##  $ product_name   : chr  "brewed coffee - dark roast" "brewed coffee - dark roast" "brewed coffee - dark roast" "brewed coffee - dark roast" ...
##  $ size           : chr  "short" "tall" "grande" "venti" ...
##  $ milk           : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ whip           : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ serv_size_m_l  : int  236 354 473 591 236 354 473 591 236 354 ...
##  $ calories       : int  3 4 5 5 3 4 5 5 3 4 ...
##  $ total_fat_g    : num  0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 ...
##  $ saturated_fat_g: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ trans_fat_g    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ cholesterol_mg : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ sodium_mg      : int  5 10 10 10 5 10 10 10 5 5 ...
##  $ total_carbs_g  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ fiber_g        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ sugar_g        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ caffeine_mg    : int  130 193 260 340 15 20 25 30 155 235 ...

Checking for any coding mistakes or extraordinary data mainly with graphs (pairing the variables and checking two at a time - also good for exploring any possible connection or correlation). Wherever I find an extreme outlier I check the variable type and will filter out if it does not make sense

ggplot(starbucks, aes(x = serv_size_m_l, y = calories)) +
  geom_point () +
  geom_smooth(method = "lm", se = FALSE)

ggplot(starbucks, aes(x = total_fat_g, y = saturated_fat_g)) +
  geom_point()

ggplot(starbucks, aes(x = trans_fat_g, y = cholesterol_mg)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE)

ggplot(starbucks, aes(x = sodium_mg, y = total_carbs_g)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE)

ggplot(starbucks, aes(x = fiber_g, y = sugar_g)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE)

ggplot(starbucks, aes(caffeine_mg, calories)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE)

Decisions about the coding errors and outliers

Service size equals zero does not seem to be right, I filtered them out.

Zero fat seems to be possible, I have not changed those.

There is an item with extreme high trans fat (2mg), I filtered it out. I do not really know if it is possible, but can be a quite big effect on a fitted model, so I have decided to extract it.

Other variables and values seemed to be right.

starbucks <- starbucks %>%
  filter(serv_size_m_l > 0, trans_fat_g < 2)

I would like the whip to be a factor and rename the levels (0 becomes no, 1 becomes yes). Also checking what I did during the process

starbucks <- 
  starbucks  %>%
mutate(whip = as.factor(whip))

is.factor(starbucks$whip)
## [1] TRUE
starbucks <- starbucks %>%
  mutate(whip = fct_recode(whip, 'no' = '0', 'yes' ='1'))

levels(starbucks$whip)
## [1] "no"  "yes"

I would also like size to be a factor, because it has definite levels and it is better as factor then as character to work with

starbucks <- starbucks %>%
  mutate(size = as.factor(size))

levels(starbucks$size)
## [1] "grande" "short"  "tall"   "trenta" "venti"

Building models

I saw some not so flat lines when I made the plots, so I am checking the correlations between all the numeric variables (for this I need to filter out the two factors and the character name)

summary(starbucks)
##  product_name           size          milk        whip     serv_size_m_l  
##  Length:1115        grande:333   Min.   :0.000   no :836   Min.   :236.0  
##  Class :character   short :123   1st Qu.:1.000   yes:279   1st Qu.:354.0  
##  Mode  :character   tall  :318   Median :3.000             Median :473.0  
##                     trenta: 21   Mean   :2.529             Mean   :474.2  
##                     venti :320   3rd Qu.:4.000             3rd Qu.:591.0  
##                                  Max.   :5.000             Max.   :887.0  
##     calories      total_fat_g    saturated_fat_g   trans_fat_g    
##  Min.   :  0.0   Min.   : 0.00   Min.   : 0.000   Min.   :0.0000  
##  1st Qu.:140.0   1st Qu.: 1.50   1st Qu.: 0.300   1st Qu.:0.0000  
##  Median :220.0   Median : 4.50   Median : 3.000   Median :0.0000  
##  Mean   :234.3   Mean   : 6.35   Mean   : 3.984   Mean   :0.1226  
##  3rd Qu.:320.0   3rd Qu.:10.00   3rd Qu.: 7.000   3rd Qu.:0.2000  
##  Max.   :640.0   Max.   :28.00   Max.   :20.000   Max.   :0.5000  
##  cholesterol_mg    sodium_mg     total_carbs_g     fiber_g      
##  Min.   : 0.00   Min.   :  0.0   Min.   : 0.0   Min.   :0.0000  
##  1st Qu.: 0.00   1st Qu.: 75.0   1st Qu.:22.0   1st Qu.:0.0000  
##  Median : 5.00   Median :135.0   Median :38.0   Median :0.0000  
##  Mean   :15.63   Mean   :143.5   Mean   :38.7   Mean   :0.8897  
##  3rd Qu.:30.00   3rd Qu.:200.0   3rd Qu.:54.0   3rd Qu.:1.0000  
##  Max.   :75.00   Max.   :370.0   Max.   :96.0   Max.   :9.0000  
##     sugar_g       caffeine_mg    
##  Min.   : 0.00   Min.   :  0.00  
##  1st Qu.:19.00   1st Qu.: 30.00  
##  Median :34.00   Median : 75.00  
##  Mean   :35.96   Mean   : 89.66  
##  3rd Qu.:49.50   3rd Qu.:145.00  
##  Max.   :89.00   Max.   :475.00
starbucks_cor <- starbucks %>%
  select(3, 5:14)
summary(starbucks_cor)
##       milk       serv_size_m_l      calories      total_fat_g   
##  Min.   :0.000   Min.   :236.0   Min.   :  0.0   Min.   : 0.00  
##  1st Qu.:1.000   1st Qu.:354.0   1st Qu.:140.0   1st Qu.: 1.50  
##  Median :3.000   Median :473.0   Median :220.0   Median : 4.50  
##  Mean   :2.529   Mean   :474.2   Mean   :234.3   Mean   : 6.35  
##  3rd Qu.:4.000   3rd Qu.:591.0   3rd Qu.:320.0   3rd Qu.:10.00  
##  Max.   :5.000   Max.   :887.0   Max.   :640.0   Max.   :28.00  
##  saturated_fat_g   trans_fat_g     cholesterol_mg    sodium_mg    
##  Min.   : 0.000   Min.   :0.0000   Min.   : 0.00   Min.   :  0.0  
##  1st Qu.: 0.300   1st Qu.:0.0000   1st Qu.: 0.00   1st Qu.: 75.0  
##  Median : 3.000   Median :0.0000   Median : 5.00   Median :135.0  
##  Mean   : 3.984   Mean   :0.1226   Mean   :15.63   Mean   :143.5  
##  3rd Qu.: 7.000   3rd Qu.:0.2000   3rd Qu.:30.00   3rd Qu.:200.0  
##  Max.   :20.000   Max.   :0.5000   Max.   :75.00   Max.   :370.0  
##  total_carbs_g     fiber_g          sugar_g     
##  Min.   : 0.0   Min.   :0.0000   Min.   : 0.00  
##  1st Qu.:22.0   1st Qu.:0.0000   1st Qu.:19.00  
##  Median :38.0   Median :0.0000   Median :34.00  
##  Mean   :38.7   Mean   :0.8897   Mean   :35.96  
##  3rd Qu.:54.0   3rd Qu.:1.0000   3rd Qu.:49.50  
##  Max.   :96.0   Max.   :9.0000   Max.   :89.00
str(starbucks_cor)
## 'data.frame':    1115 obs. of  11 variables:
##  $ milk           : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ serv_size_m_l  : int  236 354 473 591 236 354 473 591 236 354 ...
##  $ calories       : int  3 4 5 5 3 4 5 5 3 4 ...
##  $ total_fat_g    : num  0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 ...
##  $ saturated_fat_g: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ trans_fat_g    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ cholesterol_mg : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ sodium_mg      : int  5 10 10 10 5 10 10 10 5 5 ...
##  $ total_carbs_g  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ fiber_g        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ sugar_g        : int  0 0 0 0 0 0 0 0 0 0 ...
correlations <- starbucks_cor %>%
  cor()

correlations
##                        milk serv_size_m_l  calories total_fat_g saturated_fat_g
## milk             1.00000000   -0.04110158 0.3727522   0.4933721       0.4834919
## serv_size_m_l   -0.04110158    1.00000000 0.4574138   0.1738055       0.1588493
## calories         0.37275223    0.45741382 1.0000000   0.8027326       0.7609900
## total_fat_g      0.49337212    0.17380553 0.8027326   1.0000000       0.9649116
## saturated_fat_g  0.48349190    0.15884935 0.7609900   0.9649116       1.0000000
## trans_fat_g      0.35620533    0.12601432 0.6642680   0.8389354       0.7718382
## cholesterol_mg   0.30092362    0.13880467 0.7196849   0.8680969       0.8051751
## sodium_mg        0.33162104    0.41940461 0.8304040   0.5642401       0.5531148
## total_carbs_g    0.23947357    0.54374898 0.9162816   0.5259243       0.5072440
## fiber_g          0.11222790    0.11115593 0.3136440   0.2506472       0.1869549
## sugar_g          0.22591379    0.53813187 0.8933887   0.5011799       0.4901659
##                 trans_fat_g cholesterol_mg sodium_mg total_carbs_g   fiber_g
## milk              0.3562053      0.3009236 0.3316210     0.2394736 0.1122279
## serv_size_m_l     0.1260143      0.1388047 0.4194046     0.5437490 0.1111559
## calories          0.6642680      0.7196849 0.8304040     0.9162816 0.3136440
## total_fat_g       0.8389354      0.8680969 0.5642401     0.5259243 0.2506472
## saturated_fat_g   0.7718382      0.8051751 0.5531148     0.5072440 0.1869549
## trans_fat_g       1.0000000      0.9653259 0.4474555     0.4202644 0.1308541
## cholesterol_mg    0.9653259      1.0000000 0.4897653     0.4753452 0.1137174
## sodium_mg         0.4474555      0.4897653 1.0000000     0.8291630 0.1434920
## total_carbs_g     0.4202644      0.4753452 0.8291630     1.0000000 0.2358490
## fiber_g           0.1308541      0.1137174 0.1434920     0.2358490 1.0000000
## sugar_g           0.4121353      0.4675981 0.8328058     0.9907429 0.1172173
##                   sugar_g
## milk            0.2259138
## serv_size_m_l   0.5381319
## calories        0.8933887
## total_fat_g     0.5011799
## saturated_fat_g 0.4901659
## trans_fat_g     0.4121353
## cholesterol_mg  0.4675981
## sodium_mg       0.8328058
## total_carbs_g   0.9907429
## fiber_g         0.1172173
## sugar_g         1.0000000

I am visulazing the correlations to see it better

corrplot(correlations, method = "circle")

Understanding the correlation matrix

The diagonal is every variable with itself, they are a perfect +1 correlation as they should be. As it was predictable different measurments of fat (trans_fat_g, total_fat_g, saturated_fat_g) have a strong correlation (r = 0.77 - 0.96). Carbs (total_carbs_g) and sugar (sugar_g) also have a predicted correlation (r = 0.99), the almost perfect correaltion makes sense, since the coffees containing coffee, milk, whip and sugar the only and main source of carbs is sugar. Sugar and fat has a connection with calories, moderate between trans fat and calories (r = 0.66), and strong with the others (r = 0.76 - 0.89). Just by taking a glance at the matrix, it seems that calories have the most connection to the other variables, so later, I will try to fit a model to predict calories in coffees.

Checking different size and whipped cream in relation to calories

As I have decided to work on finding out what makes a coffee with more calories, I would also like to see if it is true that whiped creamed beverages and bigger beverages have more calories

I am using independent sample t-test to see if coffees with shipped cream are more heavy in calories thank the ones without Nullhypothesis is that there is no difference, alternative hypothesis is that the whipped creamed have more calories

I am checking the normality

ggdensity(starbucks$calories, 
          main = "Density plot of calories",
          xlab = "Calories of coffees")

ggplot(starbucks, aes(x = calories, fill = whip)) +
  geom_histogram()

It seems skewed so I am running a Shapiro-Wilk test

  shapiro.test(starbucks$calories)
## 
##  Shapiro-Wilk normality test
## 
## data:  starbucks$calories
## W = 0.98115, p-value = 7.524e-11

The Shapiro-Wilk came back significant so I need a nonparametric test

I am calculating the model with robust procedure

whip_t_test <- wilcox.test(calories ~ whip, data = starbucks)
whip_t_test
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  calories by whip
## W = 27426, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0

It seems like we have a significant difference in the calories of whipped and not whipped coffees. Z = 27426, p < 0.001

I checked a boxplot to see graphically which coffees have more calories

plot(calories ~ whip, data = starbucks)

It is the whipped, so we can state that coffees with whipped cream contain more calories (mean = 371.22 vs. mean of the no whipped cream coffees = 188.54)

Now I take a look at the size of a coffee related to its calories I am bulding a one-way ANOVA model to find out if there are any differences between the groups, and first I check the assumptions My nullhypothesis is that there are no differences My alternative hypothesis is that there are differences

Checking for outliers graphically

ggboxplot(starbucks, x = "size", y = "calories",
          ylab = "Calories", xlab = "Size")

outlier <- starbucks %>%
  filter (size == "trenta", calories < 10)
outlier
##         product_name   size milk whip serv_size_m_l calories total_fat_g
## 1 Cold Brewed Coffee trenta    0   no           887        5           0
##   saturated_fat_g trans_fat_g cholesterol_mg sodium_mg total_carbs_g fiber_g
## 1               0           0              0        25             0       0
##   sugar_g caffeine_mg
## 1       0         330

I have found an outlier in size “trenta” which turned out to be a cold brewed coffee and it makes sense, since as far as I know it does not contain any milk or sugar, so I left it in the data

I am checking the homogenity of variances as an assumption.

I would like them to be not significantly different (I have checked both the Levene’s and the Bartlett)

leveneTest(calories ~size, starbucks)
## Levene's Test for Homogeneity of Variance (center = median)
##         Df F value    Pr(>F)    
## group    4  33.788 < 2.2e-16 ***
##       1110                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
bartlett.test(calories ~size, starbucks)
## 
##  Bartlett test of homogeneity of variances
## 
## data:  calories by size
## Bartlett's K-squared = 139.67, df = 4, p-value < 2.2e-16

Unfortunately they are significantly different I need to use nonparametric or robust ways of ANOVA (I found a helpful article here)

starbucks_anova <- kruskal.test(calories ~ size, data = starbucks)
starbucks_anova
## 
##  Kruskal-Wallis rank sum test
## 
## data:  calories by size
## Kruskal-Wallis chi-squared = 273.35, df = 4, p-value < 2.2e-16

I found a significant model (H(4) = 273.35, p < 0.001)

To find out which groups differ from each other I have run a pairwise t-test with Bonferroni correction.

pairwise.t.test(starbucks$calories, starbucks$size, p.adj = "bonf")
## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  starbucks$calories and starbucks$size 
## 
##        grande  short   tall    trenta 
## short  < 2e-16 -       -       -      
## tall   1.5e-11 1.5e-06 -       -      
## trenta 0.13    0.17    1.00    -      
## venti  1.0e-13 < 2e-16 < 2e-16 2.3e-06
## 
## P value adjustment method: bonferroni

It looks like short and grande, tall and grande, tall and short are different in calories. Also venti versus every else size are different. (each p < 0.001). Trenta only differs from venti.

I have produced a boxplot to visualize it

boxplot(calories ~ size, data = starbucks)

Fitting a complex model to predict calories in a coffee

Lastly I wanted to fit a complex model to predict calories in a coffee. I contained every variable that seemed to be correlated to calories.

First I built a less complex model (with obvious elements like sugar and whip) and checked the assumptions.

model1 <- lm(calories ~ whip + sugar_g + total_fat_g, starbucks)
summary(model1)
## 
## Call:
## lm(formula = calories ~ whip + sugar_g + total_fat_g, data = starbucks)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -88.040 -16.222  -4.301   9.134 114.078 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  19.14945    1.42020  13.484   <2e-16 ***
## whipyes     -21.27456    2.53915  -8.379   <2e-16 ***
## sugar_g       4.03154    0.03782 106.608   <2e-16 ***
## total_fat_g  11.88029    0.19598  60.619   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 24.03 on 1111 degrees of freedom
## Multiple R-squared:  0.9684, Adjusted R-squared:  0.9683 
## F-statistic: 1.136e+04 on 3 and 1111 DF,  p-value: < 2.2e-16
check_model(model1)

model1 %>%
  augment() %>%
  arrange(desc(.cooksd)) %>%
  head()
## # A tibble: 6 × 10
##   calories whip  sugar_g total_fat_g .fitted .resid    .hat .sigma .cooksd
##      <int> <fct>   <int>       <dbl>   <dbl>  <dbl>   <dbl>  <dbl>   <dbl>
## 1      390 yes        45          10    298.   91.9 0.00461   23.9  0.0170
## 2      430 yes        58          10    351.   79.5 0.00512   23.9  0.0141
## 3      230 yes        47          11    318.  -88.0 0.00418   23.9  0.0141
## 4      450 no         69          17    499.  -49.3 0.0121    24.0  0.0130
## 5      430 yes        37          18    361.   69.1 0.00551   24.0  0.0115
## 6      400 yes        35          16    329.   70.9 0.00469   23.9  0.0103
## # … with 1 more variable: .std.resid <dbl>

Althought independent variables does not correlate highly, we know from the correlation matrix, that they do correlate with calories (dependent variable). If it wouldn’t I would compare this with the following more complex model:

model2 <- lm(calories ~ whip + sugar_g + total_fat_g + total_carbs_g, starbucks)

I am looking for variables that are not too highly correlated with calories.

First a simple model

model3 <- lm(calories ~ milk + serv_size_m_l, starbucks)
summary(model3)
## 
## Call:
## lm(formula = calories ~ milk + serv_size_m_l, data = starbucks)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -318.26  -76.80  -10.22   76.57  324.65 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -39.81520   11.60526  -3.431 0.000624 ***
## milk           31.62288    1.93165  16.371  < 2e-16 ***
## serv_size_m_l   0.40933    0.02071  19.765  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 107.9 on 1112 degrees of freedom
## Multiple R-squared:  0.3628, Adjusted R-squared:  0.3617 
## F-statistic: 316.6 on 2 and 1112 DF,  p-value: < 2.2e-16
check_model(model3)

The assumptions are met and I got a significant model (F(1112,2)=316.6, p < 0.001), where all of the predictors are significant (p < 0.001). Adjusted R^2 is 0.362, meaning that my predictors explain 36% of the calories variances.

I am bulding a more complex model

model4 <- lm(calories ~ milk + serv_size_m_l + fiber_g, starbucks)
summary(model4)
## 
## Call:
## lm(formula = calories ~ milk + serv_size_m_l + fiber_g, data = starbucks)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -297.22  -71.47  -14.26   65.62  343.29 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -41.05071   11.16212  -3.678 0.000247 ***
## milk           29.52137    1.87076  15.780  < 2e-16 ***
## serv_size_m_l   0.38700    0.02005  19.297  < 2e-16 ***
## fiber_g        19.26499    2.01721   9.550  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 103.8 on 1111 degrees of freedom
## Multiple R-squared:  0.4111, Adjusted R-squared:  0.4096 
## F-statistic: 258.6 on 3 and 1111 DF,  p-value: < 2.2e-16
check_model(model4)

The assumptions are met again and I have a significant model (f(1111, 3) = 258.6, p < 0.001), where all service size, milk and fiber are significant (p < 0.001). Adjusted R^2 got higher, now it is 41% explained of the calories variances.

I am comparing the models without and with caffeine.

compare <- anova(model3, model4)
compare
## Analysis of Variance Table
## 
## Model 1: calories ~ milk + serv_size_m_l
## Model 2: calories ~ milk + serv_size_m_l + fiber_g
##   Res.Df      RSS Df Sum of Sq      F    Pr(>F)    
## 1   1112 12947653                                  
## 2   1111 11965348  1    982305 91.208 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Turns out that model4 with fiber in it is a significantly better choice to predict calories in a coffe. (F(1) = 91.208, p < 0.001)

A plot about the calories of different sized and whipped coffees

I wanted to see the relation between size, whipped cream and calories of coffees.

starbucks2 <- starbucks %>%
  add_count(calories) %>%
  group_by(size, whip) %>%
  mutate(mean_cal = mean(n))

ggplot(starbucks2, aes(x = reorder (size, -mean_cal), y = mean_cal, fill = whip)) +
  geom_col(position = "dodge") +
  scale_fill_brewer() +
  theme_classic() +
  labs (title = "Mean calories of Starbucks coffees by size and whipcream", x = "Size", y = "Calories") +
  theme (legend.position ="bottom")

It is interesting, because with the t-test I saw that whipped coffees have more calories, but here in subgroups it is not that obvious.

I have also made a plot about average calories in different sized Starbucks drinks

starbucks_to_plot <- starbucks %>%
  group_by(size) %>%
  count(mean(calories))

ggplot(starbucks_to_plot, aes(x = reorder(size, -n), y = n, fill = size)) +
  geom_col() +
  scale_fill_brewer() +
  theme(legend.position = "None") +
  labs(title = "Mean calories of Starbucks coffees by size", x = "Size", y = "Calories")